Method and system for speech quality prediction of an audio transmission system

ABSTRACT

Method and system for measuring the transmission quality of an audio transmission system ( 10 ). Preprocessing means ( 12 ) are present for preprocessing of an input signal (X) and an output signal (Y) to obtain pitch power densities (PPXwIKss(j)” 1′PYwrR.ss(fin) for the respective signals. Compensation means ( 13, 14 ) are provided for compensation of linear frequency response and time varying gain. Calculation means ( 13, 14 ) are present for calculation of loudness densities (LX(I)n, LY(fi,) from the compensated pitch power densities, and computation means ( 15, 16 ) are provided for computation of a score (Q) indicative of the transmission quality of the system ( 10 ) from the loudness densities. The compensation means ( 13, 14 ) comprise an iterative loop having at least three calculations of compensations, each calculation comprising one of a calculation of a compensation of linear frequency response and a calculation of a local power scaling factor.

FIELD OF THE INVENTION

The present invention relates to a method and a system for measuring thetransmission quality of a system under test, an input signal enteredinto the system under test and an output signal resulting from thesystem under test being processed and mutually compared.

PRIOR ART

Such a method and system are known from ITU-T recommendation P.862,“Telephone transmission quality, telephone installations, local linenetworks—Methods for objective and subjective assessment ofquality—Perceptual evaluation of speech quality (PESQ), an objectivemethod for end-to-end speech quality assessment of narrow-bank telephonenetworks and speech codecs”, ITU-T 02.2001 [8].

Also, the article by J. Beerends et al. “Perceptual Evaluation of SpeechQuality (PESQ) The New ITU Standard for end-to-end Speech QualityAssessment Part II—Psychoacoustic Model”, J. Audio Eng. Soc., Vol. 50,no. 10, October 2002, describes such a method and system [9].

A disadvantage is present in the P.862 method and system, as the methodand system applied in the standard quality measurement does notcorrectly compensate for large variations in frequency response of thesystem under test and for large differences in local power between inputand output signal. This may result in a bad correlation between thescores of perceived quality of speech as provided by the method andsystem and the perceived quality of speech as evaluated by test persons.

SUMMARY OF THE INVENTION

The present invention seeks to provide an improvement of the correlationbetween the perceived quality of speech as measured by the P.862 methodand system and the actual quality of speech as perceived by testpersons.

According to the present invention, a method according to the preambledefined above is provided, in which the compensation of linear frequencyresponse and time varying gain comprises an iterative loop having atleast three calculations of compensations, each calculation comprisingone of a calculation of a compensation of linear frequency response anda calculation of a local power scaling factor.

The present invention is based on the understanding that in certaincircumstances (presence of noise, presence of large frequency responsedeviations in system under test) the existing standardized method doesnot correctly measure the perceived quality of speech.

If a frequency compensation is calculated in the presence of noise awrong estimate of the frequency response function will arise infrequency regions where there is little energy. If a local temporalscaling factor is calculated on a signal that has passed through asystem which shows large deviations in the frequency response the localscaling factor cannot be calculated correctly. Both effects have to becalculated correctly in order to be able to predict the subjectivelyperceived quality of speech signals.

A correction may be implemented according to the present invention byreplacing the calculation of a linear frequency compensation and thecalculation of a local power scaling factor by an iterative calculationof the frequency compensation and local scaling factor. By firstcalculating a rough estimate of the necessary frequency compensation,i.e. by not compensating to the amount that one would normally carryout, one obtains a signal in time from which better estimations can bemade regarding the local temporal scaling factor that is necessary forcorrectly predicting the final perceived quality. After this localscaling calculation one obtains a time signal from which a betterestimation can be made for the necessary frequency compensation.

Overall, this will improve the performance of the speech qualityprediction using the method according to the invention. Also, in othercircumstances, this adaptation of the standardized method and systemwill not have a negative influence in other circumstances.

The calculation of the local power scaling factor may be implemented asdescribed in the ITU-T Recommendation P.862, or alternatively asdescribed in the non-prepublished applicant's European patentapplication 02075973 [10], which is included herein by reference.

In a particular advantageous embodiment, the iterative loop comprises acalculation of a first partial linear frequency compensation andapplication of the first partial linear frequency compensation to thepitch power density of the input signal, followed by a calculation of alocal power scaling factor and application of the local power scalingfactor to the pitch power density of the output signal, followed by acalculation of a second partial linear frequency compensation andapplication of the linear frequency compensation to the partiallycompensated pitch power density of the input signal. In a furtherembodiment, the application of the compensations to the pitch powerdensities of the input and output signal are interchanged, i.e., thefirst and second partial linear frequency compensations are applied tothe pitch power density of the output signal, and the local powerscaling factor is applied to the pitch power density of the inputsignal. These embodiments require only very little changes to theexisting standardized P.862 method, while improving its performance.

In a further embodiment, the partial linear frequency compensation is afirst estimate which is lower than the linear frequency compensation onewould use for correct evaluation of the linear distortion (as prescribedin e.g. the ITU-T Recommendation P.862), e.g. 50% of the amplitudecorrection of the normal linear frequency compensation. This partialcompensation can also be carried out frequency dependent, e.g. by havinglimited frequency ranges over which a larger partial compensation iscarried out than over other frequency ranges. One can e.g. onlycompensate frequency response compensations as found with closemicrophone techniques that result in a low frequency boost below about500 Hz.

In a second aspect, the present invention relates to a system formeasuring the transmission quality of an audio transmission system asdefined in the preamble above, in which the compensation means comprisean iterative loop having at least three calculations of a compensation,each calculation comprising one of a calculation of a compensation oflinear frequency response and a calculation of a local power scalingfactor. This system, and the systems as defined in the dependent claims,provides advantages comparable to the advantages of the method asdescribed above.

SHORT DESCRIPTION OF DRAWINGS

The present invention will be discussed in more detail below, using anumber of exemplary embodiments, with reference to the attacheddrawings, in which

FIG. 1 shows schematically a prior-art PESQ system, disclosed in ITU-Trecommendation P.862.

FIG. 2 shows a view of a perceptual model implementation as used in thePESQ system of FIG. 1.

FIG. 3 shows the same PESQ implementation as FIG. 2 which, however, ismodified to be fit for executing the method according to an embodimentof the present invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

FIG. 1 shows schematically a known set-up of an application of anobjective measurement technique which is based on a model of humanauditory perception and cognition, and which follows the ITU-TRecommendation P.862 [8], for estimating the perceptual quality ofspeech links or codecs. The acronym used for this technique or device isPESQ (Perceptual Evaluation of Speech Quality). It comprises a system ortelecommunications network under test 10, hereinafter referred to assystem 10 for briefness' sake, and a quality measurement device 11 forthe perceptual analysis of speech signals offered. A speech signal X₀(t)is used, on the one hand, as an input signal of the system 10 and, onthe other hand, as a first input signal X(t) of the device 11. An outputsignal Y(t) of the system 10, which in fact is the speech signal X₀(t)affected by the system 10, is used as a second input signal of thedevice 11. An output signal Q of the device 11 represents an estimate ofthe perceptual quality of the speech link through the system 10. Sincethe input end and the output end of a speech link, particularly in theevent it runs through a telecommunications network, are remote, for theinput signals of the quality measurement device 11 use is made in mostcases of speech signals X(t) stored on data bases. Here, as iscustomary, speech signal is understood to mean each sound basicallyperceptible to the human hearing, such as speech and tones. The systemunder test 10 may of course also be a simulation system, which simulatesa telecommunications network. The device 11 carries out a mainprocessing step which comprises successively, in a pre-processingsection 11.1, a step of pre-processing carried out by pre-processingmeans 12, in a processing section 11.2, a further processing stepcarried by first and second signal processing means 13 and 14, and, in asignal combining section 11.3, a combined signal processing step carriedout by signal differentiating means 15 and modelling means 16. In thepre-processing step the signals X(t) and Y(t) are prepared for the stepof further processing in the means 13 and 14, the pre-processingincluding power level scaling and time alignment operations. The furtherprocessing step implies mapping of the (degraded) output signal Y(t) andthe reference signal X(t) on representation signals R(Y) and R(X)according to a psycho-physical perception model of the human auditorysystem. During the combined signal processing step a differential ordisturbance signal D is determined by the differentiating means 15 fromsaid representation signals, which is then processed by modelling means16 in accordance with a cognitive model, in which certain properties ofhuman testees have been modelled, in order to obtain the quality signalQ.

In a first step executed by the PESQ system a series of delays betweenoriginal input and degraded output are computed, one for each timeinterval for which the delay is significantly different from theprevious time interval. For each of these intervals a correspondingstart and stop point is calculated. The alignment algorithm is based onthe principle of comparing the confidence of having two delays in acertain time interval with the confidence of having a single delay forthat interval. The algorithm can handle delay changes both duringsilences and during active speech parts.

Based on the set of delays that are found, the PESQ system compares theoriginal (input) signal with the aligned degraded output of the deviceunder test using a perceptual model. The key to this process istransformation of both the original and the degraded signals to internalrepresentations (LX, LY), analogous to the psychophysical representationof audio signals in the human auditory system, taking account ofperceptual frequency (Bark) and loudness (Sone). This is achieved inseveral stages: time alignment, level alignment to a calibratedlistening level, time-frequency mapping, frequency warping, andcompressive loudness scaling.

The internal representation is processed to take account of effects suchas local gain variations and linear filtering that may—if they are nottoo severe—have little perceptual significance. This is achieved bylimiting the amount of compensation and making the compensation lagbehind the effect. Thus minor, steady-state differences between originaland degraded are compensated. More severe effects, or rapid variations,are only partially compensated so that a residual effect remains andcontributes to the overall perceptual disturbance. This allows a smallnumber of quality indicators to be used to model all subjective effects.In the PESQ system, two error parameters are computed in the cognitivemodel; these are combined to give an objective listening quality MOS(Mean Opinion Score). The basic ideas used in the PESQ system aredescribed in the bibliography references [1] to [5].

The Perceptual Model in the Prior-Art PESO System

In FIG. 2, a part of an implementation of the device 11 (i.e. theperceptual model part) is illustrated, comprising in essence the firstand second signal processing means 13 and 14, and the differentiatingmeans 15 as described above.

The perceptual model of a PESQ system, shown in FIG. 2, is used tocalculate a distance between the original and degraded speech signal(“PESQ score”). This may be passed through a monotonic function toobtain a prediction of a subjective MOS for a given subjective test. ThePESQ score is mapped to a MOS-like scale.

Absolute Hearing Threshold

The absolute hearing threshold P₀(ƒ) is interpolated to get the valuesat the center of the Bark bands that are used. These values are storedin an array and are used in Zwicker's loudness formula.

The Power and Loudness Scaling Factors

There are arbitrary gain constants following the FFT for time-frequencyanalysis and in the loudness calculation only meant for calibrating thesystem

IRS-Receive Filtering

If it is assumed that the listening tests were carried out using an IRS(intermediate reference system) receive or a modified IRS receivecharacteristic in the handset the necessary filtering to the speechsignals is applied in the pre-processing (section 11.1 in FIG. 1),resulting in signals X_(IRSS)(t) and Y_(IRSS)(t).

Computation of the Active Speech Time Interval

If the original and degraded speech file start or end with large silentintervals, this could influence the computation of certain averagedistortion values over the files. Therefore, an estimate is made of thesilent parts at the beginning and end of these files.

Short Term FFT or Time-Frequency Decomposition

The human ear performs a time-frequency transformation. In the PESQsystem this is implemented by a short term FFT with overlap betweensuccessive time windows (frames). The power spectra—the sum of thesquared real and squared imaginary parts of the complex FFTcomponents—are stored in separate real valued arrays for the originaland degraded signals. Phase information within a single Hanning windowis discarded in the PESQ system and all calculations are based on onlythe power representations PX_(WIRSS)(ƒ)_(n) and PY_(WIRSS)(ƒ)^(n). Thestart points of the windows in the degraded signal are shifted over thedelay. The time axis of the original speech signal is left as is. If thedelay increases, parts of the degraded signal are omitted from theprocessing, while for decreases in the delay parts are repeated.

Calculation of the Pitch Power Densities

The Bark scale reflects that at low frequencies, the human hearingsystem has a finer frequency resolution than at high frequencies. Thisis implemented by binning FFT bands and summing the corresponding powersof the FFT bands with a normalization of the summed parts. The warpingfunction that maps the frequency scale in Hertz to the pitch scale inBark does not exactly follow the values given in the literature. Theresulting signals are known as the pitch power densitiesPPX_(WIRSS)(ƒ)_(n), and PPY_(WIRSS)(ƒ)_(n).

Compensation of the Original Pitch Power Density (linear FrequencyResponse Compensation)

To deal with filtering in the system under test, the power spectrum ofthe original and degraded, pitch power densities are averaged over time.This average is calculated over speech active frames only usingtime-frequency cells whose power is a certain fraction above theabsolute hearing threshold. Per modified Bark bin, a partialcompensation factor is calculated from the ratio of the degradedspectrum to the original spectrum. The original pitch power densityPPX_(WIRSS)(ƒ)_(n) of each frame n is then multiplied with this partialcompensation factor to equalize the original to the degraded signal.This results in an inversely filtered original pitch power densityPPX′_(WIRSS)(ƒ)_(n). This partial compensation is used because severefiltering can be disturbing to the listener. The compensation is carriedout on the original signal because the degraded signal is the one thatis judged by the subjects in an ACR experiment.

Compensation of the Distorted Pitch Power Density (Time-Varying GainCompensation)

Short-term gain variations are partially compensated by processing thepitch power densities frame by frame (i.e. local compensation). For theoriginal and the degraded pitch power densities, the sum in each frame nof all values that exceed the absolute hearing threshold is computed.The ratio of the power in the original and the degraded files iscalculated and bounded to a predetermined range. A first order low passfilter (along the time axis) is applied to this ratio. The distortedpitch power density in each frame, n, is then multiplied by this ratio,resulting in the partially gain compensated distorted pitch powerdensity PPY′_(WIRSS)(ƒ)_(n).

This partial compensation or calculation of local scaling factor may beimplemented using the embodiment described in the applicant's pending,non-prepublished European patent application 02075973.4, which isincorporated herein by reference (see specifically FIG. 3).

Calculation of the Loudness Densities

After compensation for filtering and short-term gain variations, theoriginal and degraded pitch power densities are transformed to a Soneloudness scale using Zwicker's law [7].

${{LX}(f)}_{n} = {S_{l} \cdot \left( \frac{P_{0}(f)}{0.5} \right)^{\gamma} \cdot \left\lbrack {\left( {0.5 + {0.5 \cdot \frac{{{PPX}_{WIRSS}^{\prime}(f)}_{n}}{P_{0}(f)}}} \right)^{\gamma} - 1} \right\rbrack}$with P₀(ƒ) the absolute threshold and S₁ the loudness scaling factor.Above 4 Bark, the Zwicker power, γ, is 0.23, the value given in theliterature. Below 4 Bark, the Zwicker power is increased slightly toaccount for the so-called recruitment effect. The resultingtwo-dimensional arrays LX(ƒ)_(n) and LY(ƒ)_(n) are called loudnessdensities.Calculation of the Disturbance Density

The signed difference between the distorted and original loudnessdensity is computed. When this difference is positive, components suchas noise have been added. When this difference is negative, componentshave been omitted from the original signal. This difference array iscalled the raw disturbance density.

The minimum of the original and degraded loudness density is computedfor each time frequency cell. These minima are multiplied by 0.25. Thecorresponding two-dimensional array is called the mask array. Thefollowing rules are applied in each time-frequency cell:

-   -   If the raw disturbance density is positive and larger than the        mask value, the mask value is subtracted from the raw        disturbance.    -   If the raw disturbance density lies in between plus and minus        the magnitude of the mask value the disturbance density is set        to zero.    -   If the raw disturbance density is more negative than minus the        mask value, the mask value is added to the raw disturbance        density.        The net effect is that the raw disturbance densities are pulled        towards zero. This represents a dead zone before an actual time        frequency cell is perceived as distorted. This models the        process of small differences being inaudible in the presence of        loud signals (masking) in each time-frequency cell. The result        is a disturbance density as a function of time (window number η)        and frequency, D(ƒ)_(n).

This perceptual subtraction of the loudness densities LX(ƒ)_(n) andLY(ƒ)_(n), resulting in the disturbance density D(ƒ)_(n), may beimplemented as described with reference to FIG. 4 of the applicant'spending, non-prepublished European patent application 02075973.4, whichis incorporated herein by reference.

Cell-Wise Multiplication with an Asymmetry Factor

The asymmetry effect is caused by the fact that when a codec distortsthe input signal it will in general be very difficult to introduce a newtime-frequency component that integrates with the input signal, and theresulting output signal will thus be decomposed into two differentpercepts, the input signal and the distortion, leading to clearlyaudible distortion [2]. When the codec leaves out a time-frequencycomponent the resulting output signal cannot be decomposed in the sameway and the distortion is less objectionable. This effect is modelled bycalculating an asymmetrical disturbance density DA(ƒ)_(n) per frame bymultiplication of the disturbance density D(ƒ)_(n) with an asymmetryfactor. This asymmetry factor equals the ratio of the distorted andoriginal pitch power densities raised to the power of 1.2. If theasymmetry factor is less than 3 it is set to zero. If it exceeds 12 itis clipped at that value. Thus only those time frequency cells remain,as non-zero values, for which the degraded pitch power density exceededthe original pitch power density.

Aggregation of the Disturbance Densities

The disturbance density D(ƒ)_(n) and asymmetrical disturbance densityDA(ƒ)_(n) are integrated (summed) along the frequency axis using twodifferent Lp norms and a weighting on soft frames having low loudness):

$D_{n} = {M_{n}\sqrt[3]{\sum\limits_{{f = 1},{\ldots\mspace{11mu}{Number}\;{of}\mspace{11mu}{Barkbands}}}\left( \left. {{D(f)}_{n}❘W_{f}} \right)^{3} \right.}}$${DA}_{n} = {M_{n}{\sum\limits_{{f = 1},{\ldots\mspace{11mu}{Number}\;{of}\mspace{11mu}{Barkbands}}}\left( {{{{DA}(f)}_{n}}W_{f}} \right)}}$with M_(n) a multiplication factor, 1/(power of original frame plus aconstant)^(0.04), resulting in an emphasis of the disturbances thatoccur during silences in the original speech fragment, and W_(f) aseries of constants proportional to the width of the modified Bark bins.After this multiplication the frame disturbance values are limited to amaximum of 45. These aggregated values, D_(n) and DA_(n), are calledframe disturbances.

If the distorted signal contains a decrease in the delay larger than 16ms (half a window) the repeat strategy is modified. It was found to bebetter to ignore the frame disturbances during such events in thecomputation of the objective speech quality. As a consequence framedisturbances are zeroed when this occurs. The resulting framedisturbances are called D′_(n) and DA′_(n).

Realignment of Bad Intervals

Consecutive frames with a frame disturbance above a threshold are calledbad intervals. In a minority of cases the objective measure predictslarge distortions over a minimum number of bad frames due to incorrecttime delays observed by the preprocessing. For those so-called, badintervals a new delay value is estimated by maximizing the crosscorrelation between the absolute original signal and absolute degradedsignal adjusted according to the delays observed by the preprocessing.When the maximal cross correlation is below a threshold, it is concludedthat the interval is matching noise against noise and the interval is nolonger called bad, and the processing for that interval is halted.Otherwise, the frame disturbance for the frames during the bad intervalsis recomputed and, if it is smaller replaces the original framedisturbance. The result is the final frame disturbances D″_(n) andDA″_(n) that are used to calculate the perceived quality.

Aggregation of the Disturbance within Split Second Intervals

Next, the frame disturbance values and the asymmetrical framedisturbance values are aggregated over split second intervals of 20frames (accounting for the overlap of frames: approx. 320 ms) using L₆norms, a higher p value as in the aggregation over the speech filelength. These intervals also overlap 50 percent and no window functionis used.

Aggregation of the Disturbance Over the Duration of the Signal

The split second disturbance values and the asymmetrical split seconddisturbance values are aggregated over the active interval of the speechfiles (the corresponding frames) now using L₂ norms. The higher value ofp for the aggregation within split second intervals as compared to thelower p value of the aggregation over the speech file is due to the factthat when parts of the split seconds are distorted that split secondloses meaning, whereas if a first sentence in a speech file is distortedthe quality of other sentences remains intact.

Computation of the PESQ Score

The final PESQ score is a linear combination of the average disturbancevalue and the average asymmetrical disturbance value.

The above described PESQ method (as prescribed in the ITU-TRecommendation P.862) has the disadvantage that it can not dealcorrectly with speech signals with large differences in frequencyresponse variations. The frequency response variation compensation andlocal power scaling compensation are being calculated incorrectly,resulting in a wrong calculation of the speech quality of a system 10.

The present invention is based on the understanding that if a frequencycompensation is calculated in the presence of noise a wrong estimate ofthe frequency response function will arise in frequency regions wherethere is little energy. If a local temporal scaling factor is calculatedon a signal that has passed through system which shows large deviationsin the frequency response the local scaling factor cannot be calculatedcorrectly. Both effects have to be calculated correctly in order to beable to predict the subjectively perceived quality of speech signals.

In FIG. 3, a particular advantageous embodiment of the perceptual modelpart of the PESQ method is illustrated, corresponding to theillustration of FIG. 2. However, the calculation of the linear frequencycompensation and the calculation of the local power scaling factor aredifferent.

The linear frequency response compensation calculation and local powerscaling factor calculation are put in an iterative loop. First, a roughestimate of the necessary frequency compensation is calculated. Next apartial linear frequency compensation is calculated which is lower thanthe linear frequency compensation one would use for correct evaluationof the linear distortion, e.g. 50% of the amplitude correction of thenormal linear frequency compensation. This partial compensation can alsobe carried out by having limited frequency ranges over which a largerpartial compensation is carried out than over other frequency ranges.One can e.g. only compensate frequency response variations as found withclose microphone techniques that result in a low frequency boost belowabout 500 Hz.

By not compensating to the amount that one would normally carry out, oneobtains a signal in time PPX′_(WIRSS)(ƒ)_(n) from which betterestimations can be made regarding the local temporal scaling factor thatis necessary for correctly predicting the final perceived quality. Afterthis local scaling calculation, applied to the degraded signalPPY_(WIRSS)(ƒ)_(n) one obtains a time signal PPY′_(WIRSS)(ƒ)_(n) fromwhich a better estimation can be made for the final necessary frequencycompensation. The final frequency compensation (i.e. compensation forthe remaining frequency deviations) applied to the partially compensatedsignal PPX′_(WIRSS)(ƒ)_(n) results in a final signalPPX″_(WIRSS)(ƒ)_(n). The resulting signals PPY′_(WIRSS)(ƒ)_(n) andPPX″_(WIRSS)(ƒ)_(n) are then further processed as described above(warping to loudness scale and subsequent steps).

For the person skilled in the art, it will be clear that furthermodifications can be made to the present embodiment. The amount ofpartial compensation can be adapted to the experimental context. Also itis possible to first calculate and apply a partial local power-scalingfactor compensation, then calculate and apply the linear frequencyresponse compensation and finally calculate and apply a final localpower scaling factor. Also it is within the scope of the presentinvention to use more than three sub-steps in the iterative calculationsteps.

REFERENCES INCORPORATED HEREIN BY REFERENCE

-   [1] BEERENDS (J. G.), STEMERDINK (J. A.): A Perceptual    Speech-Quality Measure Based on a Psychoacoustic Sound    Representation, J. Audio Eng. Soc., Vol. 42, No. 3, pp. 115-123,    March 1994.-   [2] BEERENDS (J. G.): Modelling Cognitive Effects that Play a Role    in the Perception of Speech Quality, Speech Quality Assessment,    Workshop papers, Bochum, pp. 1-9, November 1994.-   [3] BEERENDS (J. G.): Measuring the quality of speech and music    codecs, an integrated psychoacoustic approach, 98th AES Convention,    pre-print No. 3945, 1995.-   [4] HOLLIER (M. P.), HAWKSFORD (M. O.), GUARD (D. R.): Error    activity and error entropy as a measure of psychoacoustic    significance in the perceptual domain, IEE Proceedings—Vision, Image    and Signal Processing, 141 (3), 203-208, June 1994.-   [5] RIX (A. W.), REYNOLDS (R.), HOLLIER (M. P.): Perceptual    measurement of end-to-end speech quality over audio and packet-based    networks, 106th AES Convention, pre-print No. 4873, May 1999.-   [6] HOLLIER (M. P.), HAWKSFORD (M. O.), GUARD (D. R.),    Characterisation of communications systems using a speech-like test    stimulus, Journal of the AES, 41 (12), 1008-1021, December 1993.-   [7] ZWICKER (Feldtkeller): Das Ohr als Nachrichtenempfänger, S.    Hirzel Verlag, Stuttgart, 1967.-   [8] ITU-T recommendation P.862, “Perceptual evaluation of speech    quality (PESQ), an objective method for en-to-end speech qualtity    assessment of narrow-band telephone networks and speech codecs”,    ITU-T 02.2001-   [9] BEERENDS (J. G.); HEKSTRA (A. P.); RIX (A. W.); HOLLIER (M. P.),    Perceptual Evaluation of Speech Quality (PESQ) The New ITU Standard    for ENd-to-End Speech Quality Assessment Part II—Psychoacoustic    Model, J. Audio Eng. Soc., Vol. 50, no. 10, October 2002.-   [10] European patent application EP02075973, Koninklijke KPN N.V.

1. A method for measuring the transmission quality of an audiotransmission system, an input signal (X) being entered into the system,resulting in an output signal (Y), in which both the input signal (X)and the output signal (Y) are processed, comprising the steps of:preprocessing of the input signal (X) and output signal (Y) to obtainpitch power densities (PPX_(WIRSS)(ƒ)_(n), PPY_(WIRSS)(ƒ)_(n)) for therespective signals; compensating the pitch power densities for linearfrequency response and time varying gain so as to obtain compensatedpitch power densities (PPX″_(WIRSS)(ƒ)_(n), PPY′_(WIRSS)(ƒ)_(n)),wherein the compensation of linear frequency response and time varyinggain comprises an iterative loop having at least three compensationcalculations, the calculations having a calculation of a first partialcompensation of a first type, a calculation of a compensation of asecond type, and a calculation of a second partial compensation of thefirst type, the first type of calculation and the second type ofcalculation comprising a different one of a calculation of compensationof linear frequency response and a calculation of a local power scalingfactor; and computing, in response to compensated pitch power densities(PPX″_(WIRSS)(ƒ)_(n), PPY′_(WIRSS)(ƒ)_(n)), a score (Q) indicative oftransmission quality of the system.
 2. Method according to claim 1, inwhich the iterative loop comprises a calculation of a first partiallinear frequency compensation and application of the first partiallinear frequency compensation to the pitch power density of the inputsignal (PPX_(WIRSS)(ƒ)_(n)), followed by a calculation of a local powerscaling factor and application of the local power scaling factor to thepitch power density of the output signal (PPY_(WIRSS)(ƒ)_(n)), followedby a calculation of a second partial linear frequency compensation andapplication of the linear frequency compensation to the partiallycompensated pitch power density of the input signal(PPX′_(WIRSS)(ƒ)_(n)).
 3. Method according to claim 2, in which thefirst partial linear frequency compensation is a first estimate which islower than a linear frequency compensation required for correctevaluation of the linear distortion.
 4. Method according to claim 3, inwhich the first partial linear frequency compensation is a frequencydependent function.
 5. Method according to claim 1, in which theiterative loop comprises a calculation of a first partial linearfrequency compensation and application of the first partial linearfrequency compensation to the pitch power density of the output signal(PPY_(WIRSS)(ƒ)_(n)), followed by a calculation of a local power scalingfactor and application of the local power scaling factor to the pitchpower density of the input signal (PPX_(WIRSS)(ƒ)_(n)), followed by acalculation of a second partial linear frequency compensation andapplication of the linear frequency compensation to the partiallycompensated pitch power density of the output signal(PPY′_(WIRSS)(ƒ)_(n)).
 6. Method according to claim 5, in which thefirst partial linear frequency compensation is a first estimate which islower than a linear frequency compensation required for correctevaluation of the linear distortion.
 7. Method according to claim 6, inwhich the first partial linear frequency compensation is a frequencydependent function.
 8. A software program product stored on computerreadable media and comprising computer executable instructions, whichwhen executed on a processing system, causes the processing system toperform the method recited in claim
 1. 9. A system for measuring thetransmission quality of an audio transmission system, an input signal(X) being entered into the system, resulting in an output signal (Y),comprising: means for preprocessing of the input signal (X) and outputsignal (Y) to obtain pitch power densities (PPX_(WIRSS)(ƒ)_(n),PPY_(WIRSS)(ƒ)_(n)) for the respective signals; means, responsive to thepitch power densities, for compensating linear frequency response andtime varying gain so as to obtain compensated pitch power densities(PPX″_(WIRSS)(ƒ)_(n), PPY′_(WIRSS)(ƒ)_(n)), wherein the compensationcomprises an iterative loop having at least three compensationcalculations, the calculations having a first partial compensation of afirst type, a calculation of a compensation of a second type, and acalculation of a second partial compensation of the first type, thefirst type of calculation and the second type of calculation comprisinga different one of a calculation of compensation of linear frequencyresponse and a calculation of a local power scaling factor; and means,responsive to from the compensated pitch power densities densities(PPX″_(WIRSS)(ƒ)_(n), PPY′_(WlRSS)(ƒ)_(n)), for computing a score (Q)indicative of transmission quality of the system.
 10. System accordingto claim 9, in which the iterative loop comprises a calculation of afirst partial linear frequency compensation and application of the firstpartial linear frequency compensation to the pitch power density of theinput signal (PPX_(WIRSS)(ƒ)_(n)), followed by a calculation of a localpower scaling factor and application of the local power scaling factorto the pitch power density of the output signal (PPY_(WIRSS)(ƒ)_(n)),followed by a calculation of a second partial linear frequencycompensation and application of the second partial linear frequencycompensation to the partially compensated pitch power density of theinput signal (PPX′_(WIRSS)(ƒ)_(n)).
 11. System according to claim 10, inwhich the first partial linear frequency compensation is a firstestimate which is lower than a linear frequency compensation requiredfor correct evaluation of the linear distortion.
 12. System according toclaim 11, in which the first partial linear frequency compensation is afrequency dependent function.
 13. System according to claim 9, in whichthe first partial linear frequency compensation is a first estimatewhich is lower than a linear frequency compensation required for correctevaluation of the linear distortion.
 14. System according to claim 13,in which the iterative loop comprises a calculation of a first partiallinear frequency compensation and application of the first partiallinear frequency compensation to the pitch power density of the outputsignal (PPY_(WIRSS)(ƒ)_(n)), followed by a calculation of a local powerscaling factor and application of the local power scaling factor to thepitch power density of the input signal (PPX_(WIRSS)(ƒ)_(n)), followedby a calculation of a second partial linear frequency compensation andapplication of the second partial linear frequency compensation to thepartially compensated pitch power density of the output signal(PPY′_(WIRSS)(ƒ)_(n)).
 15. System according to claim 13, in which thefirst partial linear frequency compensation is a frequency dependentfunction.